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Abstract 

We present BayeSum (for "Bayesian 
summarization"), a model for sentence ex- 
traction in query-focused summarization. 
BayeSum leverages the common case in 
which multiple documents are relevant to a 
single query. Using these documents as re- 
inforcement for query terms, BayeSum is 
not afflicted by the paucity of information 
in short queries. We show that approxi- 
mate inference in BayeSum is possible 
on large data sets and results in a state- 
of-the-art summarization system. Further- 
more, we show how BayeSum can be 
understood as a justified query expansion 
technique in the language modeling for IR 
framework. 

1 Introduction 

We describe BayeSum, an algorithm for perform- 
ing query-focused summarization in the common 
case that there are many relevant documents for a 
given query. Given a query and a collection of rel- 
evant documents, our algorithm functions by ask- 
ing itself the following question: what is it about 
these relevant documents that differentiates them 
from the non-relevant documents? BayeSum can 
be seen as providing a statistical formulation of 
this exact question. 

The key requirement of BayeSum is that mul- 
tiple relevant documents are known for the query 
in question. This is not a severe limitation. In two 
well-studied problems, it is the de-facto standard. 
In standard multidocument summarization (with 
or without a query), we have access to known rel- 
evant documents for some user need. Similarly, in 
the case of a web-search application, an underly- 
ing IR engine will retrieve multiple (presumably) 



relevant documents for a given query. For both of 
these tasks, BayeSum performs well, even when 
the underlying retrieval model is noisy. 

The idea of leveraging known relevant docu- 
ments is known as query expansion in the informa- 
tion retrieval community, where it has been shown 
to be successful in ad hoc retrieval tasks. Viewed 
from the perspective of IR, our work can be inter- 
preted in two ways. First, it can be seen as an ap- 
plication of query expansion to the summarization 
task (or, in IR terminology, passage retrieval); see 
(Liu and Croft, 2002; Murdock and Croft, 2005). 
Second, and more importantly, it can be seen as a 
method for query expansion in a non-ad-hoc man- 
ner. That is, BayeSum is a statistically justified 
query expansion method in the language modeling 
for IR framework (Ponte and Croft, 1998). 

2 Bayesian Query-Focused 
Summarization 

In this section, we describe our Bayesian query- 
focused summarization model (BayeSum). This 
task is very similar to the standard ad-hoc IR task, 
with the important distinction that we are compar- 
ing query models against sentence models, rather 
than against document models. The shortness of 
sentences means that one must do a good job of 
creating the query models. 

To maintain generality, so that our model is ap- 
plicable to any problem for which multiple rele- 
vant documents are known for a query, we formu- 
late our model in terms of relevance judgments. 
For a collection of D documents and Q queries, 
we assume we have a D x Q binary matrix r, 
where r^ q = 1 if an only if document d is rele- 
vant to query q. In multidocument summarization, 
r^q will be 1 exactly when d is in the document set 
corresponding to query q; in search-engine sum- 



marization, it will be 1 exactly when d is returned 
by the search engine for query q. 

2.1 Language Modeling for IR 

BayeSum is built on the concept of language 
models for information retrieval. The idea behind 
the language modeling techniques used in IR is 
to represent either queries or documents (or both) 
as probability distributions, and then use stan- 
dard probabilistic techniques for comparing them. 
These probability distributions are almost always 
"bag of words" distributions that assign a proba- 
bility to words from a fixed vocabulary V. 

One approach is to build a probability distri- 
bution for a given document, Pd(), and to look 
at the probability of a query under that distribu- 
tion: Pd(q). Documents are ranked according to 
how likely they make the query (Ponte and Croft, 
1998). Other researchers have built probability 
distributions over queries p q {-) and ranked doc- 
uments according to how likely they look under 
the query model: p q (d) (Lafferty and Zhai, 2001). 
A third approach builds a probability distribution 
Pq(-) for the query, a probability distribution Pd(-) 
for the document and then measures the similarity 
between these two distributions using KL diver- 
gence (Lavrenko et al., 2002): 

KL(p q \\p d ) = Y,P q W log ^TT (1) 

The KL divergence between two probability 
distributions is zero when they are identical and 
otherwise strictly positive. It implicitly assumes 
that both distributions p q and pd have the same 
support: they assign non-zero probability to ex- 
actly the same subset of V; in order to account 
for this, the distributions p q and pd are smoothed 
against a background general English model. This 
final mode — the KL model — is the one on which 
BayeSum is based. 

2.2 Bayesian Statistical Model 

In the language of information retrieval, the query- 
focused sentence extraction task boils down to es- 
timating a good query model, p q (-). Once we have 
such a model, we could estimate sentence models 
for each sentence in a relevant document, and rank 
the sentences according to Eq (1). 

The BayeSum system is based on the follow- 
ing model: we hypothesize that a sentence ap- 
pears in a document because it is relevant to some 



query, because it provides background informa- 
tion about the document (but is not relevant to a 
known query) or simply because it contains use- 
less, general English filler. Similarly, we model 
each word as appearing for one of those purposes. 
More specifically, our model assumes that each 
word can be assigned a discrete, exact source, such 
as "this word is relevant to query q\" or "this word 
is general English." At the sentence level, how- 
ever, sentences are assigned degrees: "this sen- 
tence is 60% about query q±, 30% background 
document information, and 10% general English." 

To model this, we define a general English 
language model, p (•) to capture the English 
filler. Furthermore, for each document c4, we 
define a background document language model, 
p dk {-); similarly, for each query qj, we define 
a query-specific language model p qj (-). Every 
word in a document d^ is modeled as being gen- 
erated from a mixture of p G , p dk and {p q i : 
query qj is relevant to document d^}. Supposing 
there are J total queries and K total documents, 
we say that the nth word from the sth sentence 
in document d, Wd. S n, nas a corresponding hidden 
variable, Zd sn that specifies exactly which of these 
distributions is used to generate that one word. In 
particular, z dsn is a vector of length 1 + J + K, 
where exactly one element is 1 and the rest are 0. 

At the sentence level, we introduce a second 
layer of hidden variables. For the sth sentence in 
document d, we let 7T ds be a vector also of length 
1 + J + K that represents our degree of belief 
that this sentence came from any of the models. 
The 7Td s s lie in the J + iv'-dimensional simplex 
A J+K = {6 = (6 1 ,...,9j +K+1 ) : (Vi) 9i > 
0' J2i ®i = !}• The interpretation of the it vari- 
ables is that if the "general English" component of 
7T is 0.9, then 90% of the words in this sentence 
will be general English. The tt and z variables are 
constrained so that a sentence cannot be generated 
by a document language model other than its own 
document and cannot be generated by a query lan- 
guage model for a query to which it is not relevant. 

Since the 7rs are unknown, and it is unlikely that 
there is a "true" correct value, we place a corpus- 
level prior on them. Since tt is a multinomial dis- 
tribution over its corresponding zs, it is natural to 
use a Dirichlet distribution as a prior over tt. A 
Dirichlet distribution is parameterized by a vector 
a of equal length to the corresponding multino- 
mial parameter, again with the positivity restric- 



tion, but no longer required to sum to one. It 
has continuous density over a variable 6±,...,9i 

given by: Ur{0 | a) = ^g)^^" 1 - The 
first term is a normalization term that ensures that 
f A1 dO Vir{6 | a) = 1. 

2.3 Generative Story 

The generative story for our model defines a distri- 
bution over a corpus of queries, {qj}i- j, and doc- 
uments, {dk}i-.K, as follows: 

1 . For each query j = 1 ... J: Generate each 
word q jn in qj by (q jn ) 

2. For each document k = 1 . . . K and each 
sentence s in document k: 

(a) Select the current sentence degree ir^s 
by Vir(ir ks \ a)r k (ir ks ) 

(b) For each word w ksn in sentence s: 

• Select the word source z ks n accord- 
ing tO A4ult(z | 7T ks ) 

• Generate the word w ksn by 



P G {w k sn) 
P dk (Wksn) 
P qj {Wksn) 



if z ksn = 

if Zksn = k + 1 

if z ksn = j + K + 1 



We used r to denote relevance judgments: 
r k{^) = if any document component of tt ex- 
cept the one corresponding to k is non-zero, or if 
any query component of tt except those queries to 
which document k is deemed relevant is non-zero 
(this prevents a document using the "wrong" doc- 
ument or query components). We have further as- 
sumed that the z vector is laid out so that z$ cor- 
responds to general English, z k +i corresponds to 
document d k for < j < J and that Zj+K+i cor- 
responds to query qj for < k < K. 

2.4 Graphical Model 

The graphical model corresponding to this gener- 
ative story is in Figure 1. This model depicts the 
four known parameters in square boxes (a, p®, p D 
and p G ) with the three observed random variables 
in shaded circles (the queries q, the relevance judg- 
ments r and the words w) and two unobserved ran- 
dom variables in empty circles (the word-level in- 
dicator variables z and the sentence level degrees 
it). The rounded plates denote replication: there 
are J queries and K documents, containing S sen- 
tences in a given document and N words in a given 
sentence. The joint probability over the observed 
random variables is given in Eq (2): 




Figure 1: Graphical model for the Bayesian 
Query-Focused Summarization Model. 
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This expression computes the probability of the 
data by integrating out the unknown variables. In 
the case of the tt variables, this is accomplished 
by integrating over A, the multinomial simplex, 
according to the prior distribution given by a. In 
the case of the z variables, this is accomplished by 
summing over all possible (discrete) values. The 
final word probability is conditioned on the z value 
by selecting the appropriate distribution from p G , 
p D andp^. Computing this expression and finding 
optimal model parameters is intractable due to the 
coupling of the variables under the integral. 

3 Statistical Inference in BayeSum 

Bayesian inference problems often give rise to in- 
tractable integrals, and a large variety of tech- 
niques have been proposed to deal with this. The 
most popular are Markov Chain Monte Carlo 
(MCMC), the Laplace (or saddle-point) approxi- 
mation and the variational approximation. A third, 
less common, but very effective technique, espe- 
cially for dealing with mixture models, is expec- 
tation propagation (Minka, 2001). In this paper, 
we will focus on expectation propagation; exper- 
iments not reported here have shown variational 



EM to perform comparably but take roughly 50% 
longer to converge. 

Expectation propagation (EP) is an inference 
technique introduced by Minka (2001) as a gener- 
alization of both belief propagation and assumed 
density filtering. In his thesis, Minka showed 
that EP is very effective in mixture modeling 
problems, and later demonstrated its superiority 
to variational techniques in the Generative As- 
pect Model (Minka and Lafferty, 2003). The key 
idea is to compute an integral of a product of 
terms by iteratively applying a sequence of "dele- 
tion/inclusion" steps. Given an integral of the 
form: J A d7r p(ir) Y[ n t n (ir), EP approximates 
each term t n by a simpler term t n , giving Eq (3). 

/ dTT^vr) q(7r) = p(7v)T\t n (7v) (3) 

In each deletion/inclusion step, one of the ap- 
proximate terms is deleted from q(-), leaving 
q~ n (-) = q(-)/i n {-). A new approximation for 
£„(•) is computed so that t n {-)q~ n {-) has the same 
integral, mean and variance as t n (-)q~ n (-). This 
new approximation, t n {-) is then included back 
into the full expression for q(-) and the process re- 
peats. This algorithm always has a fixed point and 
there are methods for ensuring that the approxi- 
mation remains in a location where the integral is 
well-defined. Unlike variational EM, the approx- 
imation given by EP is global, and often leads to 
much more reliable estimates of the true integral. 

In the case of our model, we follow Minka and 
Lafferty (2003), who adapts latent Dirichlet allo- 
cation of Blei et al. (2003) to EP. Due to space 
constraints, we omit the inference algorithms and 
instead direct the interested reader to the descrip- 
tion given by Minka and Lafferty (2003). 

4 Search-Engine Experiments 

The first experiments we run are for query-focused 
single document summarization, where relevant 
documents are returned from a search engine, and 
a short summary is desired of each document. 

4.1 Data 

The data we use to train and test BayeSum 
is drawn from the Text REtrieval Conference 
(TREC) competitions. This data set consists of 
queries, documents and relevance judgments, ex- 
actly as required by our model. The queries are 



typically broken down into four fields of increas- 
ing length: the title (3-4 words), the summary (1 
sentence), the narrative (2-4 sentences) and the 
concepts (a list of keywords). Obviously, one 
would expect that the longer the query, the better 
a model would be able to do, and this is borne out 
experimentally (Section 4.5). 

Of the TREC data, we have trained our model 
on 350 queries (queries numbered 51-350 and 
401-450) and all corresponding relevant docu- 
ments. This amounts to roughly 43k documents, 
2.1m sentences and 65.8m words. The mean 
number of relevant documents per query is 137 
and the median is 81 (the most prolific query has 
968 relevant documents). On the other hand, each 
document is relevant to, on average, 1.11 queries 
(the median is 5.5 and the most generally relevant 
document is relevant to 20 different queries). In all 
cases, we apply stemming using the Porter stem- 
mer; for all other models, we remove stop words. 

In order to evaluate our model, we had 
seven human judges manually perform the query- 
focused sentence extraction task. The judges were 
supplied with the full TREC query and a single 
document relevant to that query, and were asked to 
select up to four sentences from the document that 
best met the needs given by the query. Each judge 
annotated 25 queries with some overlap to allow 
for an evaluation of inter-annotator agreement, 
yielding annotations for a total of 166 unique 
query/document pairs. On the doubly annotated 
data, we computed the inter-annotator agreement 
using the kappa measure. The kappa value found 
was 0.58, which is low, but not abysmal (also, 
keep in mind that this is computed over only 25 
of the 166 examples). 

4.2 Evaluation Criteria 

Since there are differing numbers of sentences se- 
lected per document by the human judges, one 
cannot compute precision and recall; instead, we 
opt for other standard IR performance measures. 
We consider three related criteria: mean average 
precision (MAP), mean reciprocal rank (MRR) 
and precision at 2 (P@2). MAP is computed by 
calculating precision at every sentence as ordered 
by the system up until all relevant sentences are se- 
lected and averaged. MRR is the reciprocal of the 
rank of the first relevant sentence. P@2 is the pre- 
cision computed at the first point that two relevant 
sentences have been selected (in the rare case that 



humans selected only one sentence, we use P@l). 



4.3 Baseline Models 

As baselines, we consider four strawman models 
and two state-of-the-art information retrieval mod- 
els. The first strawman, RANDOM ranks sentences 
randomly. The second strawman, POSITION, 
ranks sentences according to their absolute posi- 
tion (in the context of non-query-focused summa- 
rization, this is an incredibly powerful baseline). 
The third and fourth models are based on the vec- 
tor space interpretation of IR. The third model, 
JACCARD, uses standard Jaccard distance score 
(intersection over union) between each sentence 
and the query to rank sentences. The fourth, CO- 
SINE, uses TF-IDF weighted cosine similarity. 

The two state-of-the-art IR models used as com- 
parative systems are based on the language mod- 
eling framework described in Section 2.1. These 
systems compute a language model for each query 
and for each sentence in a document. Sentences 
are then ranked according to the KL divergence 
between the query model and the sentence model, 
smoothed against a general model estimated from 
the entire collection, as described in the case of 
document retrieval by Lavrenko et al. (2002). This 
is the first system we compare against, called KL. 

The second true system, KL+Rel is based on 
augmenting the KL system with blind relevance 
feedback (query expansion). Specifically, we first 
run each query against the document set returned 
by the relevance judgments and retrieve the top n 
sentences. We then expand the query by interpo- 
lating the original query model with a query model 
estimated on these sentences. This serves as a 
method of query expansion. We ran experiments 
ranging n in {5, 10, 25, 50, 100} and the interpo- 
lation parameter A in {0.2, 0.4, 0.6, 0.8} and used 
oracle selection (on MRR) to choose the values 
that performed best (the results are thus overly op- 
timistic). These values were n = 25 and A = 0.4. 

Of all the systems compared, only BayeSum 
and the KL+Rel model use the relevance judg- 
ments; however, they both have access to exactly 
the same information. The other models only run 
on the subset of the data used for evaluation (the 
corpus language model for the KL system and the 
IDF values for the COSINE model are computed 
on the full data set). EP ran for 2.5 hours. 
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Table 1 : Empirical results for the baseline models 
as well as BayeSum, when all query fields are 
used. 

4.4 Performance on all Query Fields 

Our first evaluation compares results when all 
query fields are used (title, summary, description 
and concepts 1 ). These results are shown in Ta- 
ble 1. As we can see from these results, the JAC- 
CARD system alone is not sufficient to beat the 
position-based baseline. The COSINE does beat 
the position baseline by a bit of a margin (5 points 
better in MAP, 9 points in MRR and 4 points in 
P@2), and is in turn beaten by the KL system 
(which is 7 points, 14 points and 4 points better 
in MAP, MRR and P@2, respectively). Blind rel- 
evance feedback (parameters of which were cho- 
sen by an oracle to maximize the P@2 metric) ac- 
tually hurts MAP and MRR performance by 0.3 
and 1.2, respectively, and increases P@2 by 1.5. 
Over the best performing baseline system (either 
KL or KL+Rel), BayeSum wins by a margin of 

7.5 points in MAP, 6.7 for MRR and 4.4 for P@2. 

4.5 Varying Query Fields 

Our next experimental comparison has to do with 
reducing the amount of information given in the 
query. In Table 2, we show the performance 
of the KL, KL-Rel and BayeSum systems, as 
we use different query fields. There are several 
things to notice in these results. First, the stan- 
dard KL model without blind relevance feedback 
performs worse than the position-based model 
when only the 3-4 word title is available. Sec- 
ond, BayeSum using only the title outperform 
the KL model with relevance feedback using all 
fields. In fact, one can apply BayeSum without 
using any of the query fields; in this case, only the 
relevance judgments are available to make sense 



'A reviewer pointed out that concepts were later removed 
from TREC because they were "too good." Section 4.5 con- 
siders the case without the concepts field. 
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Table 2: Empirical results for the position-based 
model, the KL-based models and BayeSum, with 
different inputs. 



of what the query might be. Even in this cir- 
cumstance, BayeSum achieves a MAP of 39.4, 
an MRR of 64.7 and a P@2 of 30.4, still bet- 
ter across the board than KL-Rel with all query 
fields. While initially this seems counterintuitive, 
it is actually not so unreasonable: there is signifi- 
cantly more information available in several hun- 
dred positive relevance judgments than in a few 
sentences. However, the simple blind relevance 
feedback mechanism so popular in IR is unable to 
adequately model this. 

With the exception of the KL model without rel- 
evance feedback, adding the description on top of 
the title does not seem to make any difference for 
any of the models (and, in fact, occasionally hurts 
according to some metrics). Adding the summary 
improves performance in most cases, but not sig- 
nificantly. Adding concepts tends to improve re- 
sults slightly more substantially than any other. 

4.6 Noisy Relevance Judgments 

Our model hinges on the assumption that, for a 
given query, we have access to a collection of 
known relevant documents. In most real-world 
cases, this assumption is violated. Even in multi- 
document summarization as run in the DUC com- 
petitions, the assumption of access to a collection 
of documents all relevant to a user need is unreal- 
istic. In the real world, we will have to deal with 
document collections that "accidentally" contain 
irrelevant documents. The experiments in this sec- 
tion show that BayeSum is comparatively robust. 

For this experiment, we use the IR engine that 
performed best in the TREC 1 evaluation: In- 
query (Callan et al., 1992). We used the offi- 
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Figure 2: Performance with noisy relevance judg- 
ments. The X-axis is the R-precision of the IR 
engine and the Y-axis is the summarization per- 
formance in MAP. Solid lines are BayeSum, dot- 
ted lines are KL-Rel. Blue/stars indicate title only, 
red/circles indicated title+description+summary 
and black/pluses indicate all fields. 

cial TREC results of Inquery on the subset of 
the TREC corpus we consider. The Inquery R- 
precision on this task is 0.39 using title only, and 
0.51 using all fields. In order to obtain curves 
as the IR engine improves, we have linearly in- 
terpolated the Inquery rankings with the true rel- 
evance judgments. By tweaking the interpolation 
parameter, we obtain an IR engine with improv- 
ing performance, but with a reasonable bias. We 
have run both BayeSum and KL-Rel on the rel- 
evance judgments obtained by this method for six 
values of the interpolation parameter. The results 
are shown in Figure 2. 

As we can observe from the figure, the solid 
lines (BayeSum) are always above the dotted 
lines (KL-Rel). Considering the KL-Rel results 
alone, we can see that for a non-perfect IR engine, 
it makes little difference what query fields we use 
for the summarization task: they all obtain roughly 
equal scores. This is because the performance in 
KL-Rel is dominated by the performance of the IR 
engine. Looking only at the BayeSum results, we 
can see a much stronger, and perhaps surprising 
difference. For an imperfect IR system, it is better 
to use only the title than to use the title, description 
and summary for the summarization component. 
We believe this is because the title is more on topic 
than the other fields, which contain terms like "A 
relevant document should describe " Never- 



theless, BayeSum has a more upward trend than 
KL-Rel, which indicates that improved IR will re- 
sult in improved summarization for BayeSum but 
not for KL-Rel. 

5 Multidocument Experiments 

We present two results using BayeSum in the 
multidocument summarization settings, based on 
the official results from the Multilingual Summa- 
rization Evaluation (MSE) and Document Under- 
standing Conference (DUC) competitions in 2005. 

5.1 Performance at MSE 2005 

We participated in the Multilingual Summariza- 
tion Evaluation (MSE) workshop with a system 
based on BayeSum. The task for this competi- 
tion was generic (no query) multidocument sum- 
marization. Fortunately, not having a query is 
not a hindrance to our model. To account for the 
redundancy present in document collections, we 
applied a greedy selection technique that selects 
sentences central to the document cluster but far 
from previously selected sentences (Daume III and 
Marcu, 2005a). In MSE, our system performed 
very well. According to the human "pyramid" 
evaluation, our system came first with a score of 
0.529; the next best score was 0.489. In the au- 
tomatic "Basic Element" evaluation, our system 
scored 0.0704 (with a 95% confidence interval of 
[0.0429,0.1057]), which was the third best score 
on a site basis (out of 10 sites), and was not statis- 
tically significantly different from the best system, 
which scored 0.0981. 

5.2 Performance at DUC 2005 

We also participated in the Document Understand- 
ing Conference (DUC) competition. The chosen 
task for DUC was query-focused multidocument 
summarization. We entered a nearly identical sys- 
tem to DUC as to MSE, with an additional rule- 
based sentence compression component (Daume 
III and Marcu, 2005b). Human evaluators consid- 
ered both responsiveness (how well did the sum- 
mary answer the query) and linguistic quality. Our 
system achieved the highest responsiveness score 
in the competition. We scored more poorly on the 
linguistic quality evaluation, which (only 5 out of 
about 30 systems performed worse); this is likely 
due to the sentence compression we performed on 
top of BayeSum. On the automatic Rouge-based 
evaluations, our system performed between third 



and sixth (depending on the Rouge parameters), 
but was never statistically significantly worse than 
the best performing systems. 

6 Discussion and Future Work 

In this paper we have described a model for au- 
tomatically generating a query-focused summary, 
when one has access to multiple relevance judg- 
ments. Our Bayesian Query-Focused Summariza- 
tion model (BayeSum) consistently outperforms 
contending, state of the art information retrieval 
models, even when it is forced to work with sig- 
nificantly less information (either in the complex- 
ity of the query terms or the quality of relevance 
judgments documents). When we applied our sys- 
tem as a stand-alone summarization model in the 
2005 MSE and DUC tasks, we achieved among 
the highest scores in the evaluation metrics. The 
primary weakness of the model is that it currently 
only operates in a purely extractive setting. 

One question that arises is: why does 
BayeSum so strongly outperform KL-Rel, given 
that BayeSum can be seen as Bayesian formalism 
for relevance feedback (query expansion)? Both 
models have access to exactly the same informa- 
tion: the queries and the true relevance judgments. 
This is especially interesting due to the fact that 
the two relevance feedback parameters for KL- 
Rel were chosen optimally in our experiments, yet 
BayeSum consistently won out. One explanation 
for this performance win is that BayeSum pro- 
vides a separate weight for each word, for each 
query. This gives it significantly more flexibility. 
Doing something similar with ad-hoc query ex- 
pansion techniques is difficult due to the enormous 
number of parameters; see, for instance, (Buckley 
and Salton, 1995). 

One significant advantage of working in the 
Bayesian statistical framework is that it gives us 
a straightforward way to integrate other sources of 
knowledge into our model in a coherent manner. 
One could consider, for instance, to extend this 
model to the multi-document setting, where one 
would need to explicitly model redundancy across 
documents. Alternatively, one could include user 
models to account for novelty or user preferences 
along the lines of Zhang et al. (2002). 

Our model is similar in spirit to the random- 
walk summarization model (Otterbacher et al., 
2005). However, our model has several advan- 
tages over this technique. First, our model has 



no tunable parameters: the random- walk method 
has many (graph connectivity, various thresholds, 
choice of similarity metrics, etc.). Moreover, since 
our model is properly Bayesian, it is straightfor- 
ward to extend it to model other aspects of the 
problem, or to related problems. Doing so in a non 
ad-hoc manner in the random-walk model would 
be nearly impossible. 

Another interesting avenue of future work is to 
relax the bag-of-words assumption. Recent work 
has shown, in related models, how this can be done 
for moving from bag-of-words models to bag-of- 
ngram models (Wallach, 2006); more interesting 
than moving to ngrams would be to move to de- 
pendency parse trees, which could likely be ac- 
counted for in a similar fashion. One could also 
potentially relax the assumption that the relevance 
judgments are known, and attempt to integrate 
them out as well, essentially simultaneously per- 
forming IR and summarization. 
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